
® EUROPEAN PATENT APPLICATION 



@ Application number: 93301433.4 


@ Int. CI*: H048 1/10, H03M 7/02 


@ Date of mng : 25.02.93 




@ Prfonty : 02.03.92 US 844819 

@ Date of publication of application : 
06.10.93 Bulletin 93/40 


@ Inventor : de Sousa Ferreira, Anibal Joao 
Av. Zeferino de OUveira 239 
P-4560 Penaflel (PT) 
Inventor : Johnston, James David 
8 Valley View Road 
Warren, New Jersey 07059 (US) 


@ Designated Contracting States : 
OE FR GB IT NL 

0 Applicant: AMERICAN TELEPHONE AND 
TELEGRAPH COMPANY 
32 Avenue of the Anierlcas 
New York, NY 10013-2412 (US) 


@ Representative : Buckley, Christopher Simon 
Thirsk et al 

AT & T (UK) LTD. 5 Mornington Road 
Woodford Green, Essex IG8 OTU (GB) 



@ A method and appartus for the perceptual coding of audio signals. 



@ A method and apparatus for perfonming a Modified Discrete Cosine Transfonm on an audio signal Is 
disdosed which utilizes a Discrete Fourier Transfonn. Illustratively, the MDCT spectral coefficients for 
the signal are generated from the real FFT spectral coefficients. 



tn 



CL 
Ui 



Best Available Gopy 



Jouve, 16, rue Sdint*Denis, 75001 PARIS 



EP 0 564 089 A1 



CfosS"Reference to Related Applications and Materials 

The followfng U.S. patent applications filed concurrently with the present application and assigned to the 
assignee of the present application are related to the present application and each Is hereto incorporated here- 
5 in as if set forth in its entirety: 'RATE LOOP PROCESSOR FOR PERCEPTUAL ENCODER/DECODER, by 
J.D. Johnston; 'A METHOD AND APPARATUS FOR CODING AUDIO SIGNM.S BASED ON PERCEPTUAL 
MODEL." by J.D. Johnston; and "AN ENTROPY CODER,' by J.D. Johnston and J.A. Reeds. 

Field of the Invention 

10 

The present invention relates to processing of Information signals, and more particularly, to the efficient 
encoding and decoding of monophonic and stereophonic audio signals, including signals representative of 
voice and music information, for storage or transmission. 

15 Background of the Invention 

Consumer, industrial, studio and laboratory products for storing, processing and communicating high qual- 
ity audio signals are in great demand. For example, so-called compact disc (XD)") and digital audio tape 
("OAT") recordings for music have largely replaced the fong-popular phonograph record and cassette tape. 

20 Likewise, recently available digital audio tape ("DAT*) recordings promise to provide greater flexibility and high 
storage density for high quality audio signals. See. also. Tan and Vermeulen, "Digital audio tape for data stor- 
age", IEEE Spectrum, pp. 34-38 (Oct. 1 9$9). A demand is also arising for broadcast applications of digital tech- 
nology that offer CD-like quality. 

While these emerging digital techniques are capable of producing high quality signals, such performance 

25 is often achieved only at the expense of considerable data storage capacity or transmission bandwidth. Ac- 
cordingly, much work has been done in an attempt to compress high quality audio signals for storage and trans* 
mission. 

Most of the prior work directed to compressing signals for transmission and storage has sought to reduce 
the redundancies that the source of the signals places on the signal. Thus, such techniques as ADPCM. sub- 
30 band coding and transform coding described, e.g.. in N.S. Jayant and P. Notl, 'Digital Coding of Waveforms." 
Prentice-Hall. Inc. 1984, have sought to eliminate redundancies that otherwise would exist in the source sig- 
nals. 

In other approaches, the irrelevant information in source signals is sought to be eliminated using techni- 
ques based on models of the human perceptual system. Such techniques are described, e.g.. in E.F, Schroeder 

35 and J.J. Platte. "*MSC': Stereo Audio Coding with CD-Quality and 256 kSIT/SEC* IEEE Trans, on Consumer 
Electronics. Vol. CE-33. No. 4. November 1987; and Johnston. Transform Coding of Audio Signals Using Noise 
Criteria, Vol. 6. No. 2, IEEE J.S.C-A. (Feb. 1988). 

Perceptual coding, as described, e.g.. in the Johnston paper relates to a technique for towering required 
bitrates (or reapportioning available bits) or total number of bits in representing audio signals. In this form of 

40 coding, a masking threshold for unwanted signals is identified as a functkan of frequency of the desired signal. 
Then. Inter alia, the coarseness of quantizing used to represent a signal component of the desired signal is 
selected such that the quantizing noise introduced by the coding does not rise above the noise threshold, 
though it may be quite near this threshold. The introduced noise is therefore masked in the perception process. 
While traditional signal-to- noise ratios for such perceptually coded signals may be relatively low. the quality 

45 of these signals upon decoding, as perceived by a human listener, is nevertheless high. 

Brandenburg et al. U .S. Patent 5.040.21 7. issued August 1 3. 1 991 . describes a system for efficiently coding 
and decoding high quality audio signals using such perceptual conslderattons. In particular, using a measure 
of the "noise-like" or "tone-like" quality of the input signals, the embodiments described in the latter system 
provides a very ef f ident coding for monophonic audio signals. 

so It is. of course, important that the coding techniques used to compress audb signals do not themselves 
introduce offensive components or artifacts. This is especially important when coding stereophonic audio in- 
formation where coded information corresponding to one stereo channel, when decoded for reproduction, can 
interfere or interact with coding information corresponding to the other stereo channel. Implementation choices 
for coding two stereo channels include so-called "dual mono" coders using two independent coders operating 

55 at fixed bit rates. By contrast 'joint mono" coders use two monophonic coders but share one combined bit rate, 
i.e.. the bit rate for the two coders is constrained to be less than or equal to a fixed rate, but trade- offs can 
be made between the bit rates for individual coders. "Joint stereo" coders are those that attempt to use inter- 
channel properties for the stereo pair for realizing additional coding gain, 
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It has been found that the independent coding of the two channels of a stereo pair, especially at low bit- 
rates, can lead to a number of undesirable psychoacoustic artifacts. Among them are those related to the lo- 
calization of coding noise that does not match the localization of the dynamically imaged signal. Thus the hu- 

5 man stereophonic perception process appears to add constraints to t he encoding process if such mismatched 
localization is to be avoided. This finding is consistent with reports on binaural masklng-tevel differences that 
appear to exist, at least for low frequencies, such that noise may be isolated spatiaOy, Such binaural masking- 
level differences are considered to unmask a noise component that would be masked in a monophonic system. 
See, for example. 8.CJ. Morre, "An Introduction to the Psychology of Hearing* Second Edition/ especially 

to chapter 5, Acadentic Press, Orlando. FL. 1982. 

One technique for reducing psychoacoustic artifacts in the stereophonic context employs the IS0-WG11- 
MPEG-Audio Psychoacoustic It (ISO] Model, In this model, a second limit of signal-to-noise ratio ("SNR") is 
applied to signal>to-noise ratios inside the psychoacoustic model. However, such additional SNR constraints 
typically require the expenditure of additional channel capacity or (in storage applications) the use of additional 

15 storage capacity, at low frequendes, while also degrading the monophonic performance of the coding. 

Summary of the Invention 

Limitations of the prior art are overcome and a technical advance is made in a method and apparatus for 

20 coding a stereo pair of high quality audio channels in accordance with aspects of the present inventbn. Inter- 
channel redundancy and irrelevancy are exploited to achieve lower bit-rates while maintaining high quality re- 
production after decoding. Whfle particularly appropriate to stereophonic coding and decoding, the advantages 
of the present invention may also be realized in conventional dual monophonic stereo coders. 

An illustrative embodiment of the present invention employs a filter bank architecture using a Modified Ots* 

35 Crete Cosine Transform (MDCT). In order to code the full range of signals that may be presented to the system, 
the Olustrative embodiment advantageously uses both L/R (Left and Right) and M/S (Sum/Difference) coding, 
switched in t>oth frequency and time in a signal dependent fashion. Anew stereophonic noise masking model 
advantageously detects and avoids binaural artifacts in the coded stereophonic signal. Interchannet redundart- 
cy is exploited to provide enhanced compression for without degrading audio quality. 

30 The time behavior of both Right and Left audio channels is advantageously accurately monitored and the 
results used to control the temporal resolution of the coding process. Thus, in one aspect, an illustrative em- 
bodiment of the present Invention, provides processing of input signals in terms of either a normal MDCT win- 
dow, or, when signal conditions indicate, shorter windows. Further, dynamic switching between RIGHT/LEFT 
or SUM/OIFFERENCE coding modes is provided both in time and frequency to control unwanted binaural noise 

35 localization, to prevent the need for overcoding of SUM/dlFFERENCE signals, and to maximize the global cod- 
ing gain. 

Atypical bitstream definition and rate control loop are described which provide useful flexibility in forming 
the coder output. Interchannet irrelevancies. are advantageously eliminated and stereophonic noise masking 
improved, thereby to achieve improved reproduced audio quality in jointly coded stereophonic pairs. The rate 
40 control method used in an illustrative embodiment uses an interpolation between absolute thresholds and 
masking threshold for signals below the rate-limit of the coder, and a threshold elevation strategy under rate- 
limited conditions. 

In accordance with an overall coder/decoder system aspect of the present invention, it proves advanta- 
geously to em(^oy an improved Huffman- like entropy coder/decoder to further reduce the channel bit rate re- 

45 quirements. or storage capacity for storage applications. The noiseless compression method illustratively used 
employs Huffman coding along with a frequency-partitioning scheme to efficiently code the frequency samples 
for U R, M and S. as may be dictated by the perceptual threshold. 

The present invention provides a mechanism for determining the scale factors to be used in quantizing 
the audio signal (i.e.. the MOCT coefficients output from the analysis filter bank) by using an approach different 

so from the pnor art, and while avoiding many of the restrictions and costs of prior quantizer/rate-ioops. The audio 
signals quantized pursuant to the present invention introduce less noise and encode into fewer bits than the 
prior art. 

These results are obtained In an illustrative embodiment of the present invention whereby the utilized scale 
factor, is iterativety derived by interpolating between a scale factor derived from a calculated threshold of hear- 
55 ing at the frequency corresponding to the frequency of the respective spectral coefficient to be quantized and 
a scale factor derived from the absolute threshold of hearing at said frequency until the quantized spectral coef- 
ficients can be encoded within permissible limits. 
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Brief Description of the Orawings 

FIG. 1 presents an illustrative prior art audio comnrtunication/storage system of a type in which aspects of 
5 the present invenfion find application, and provides improvement and extension. 

FIG. 2 presents an illustrative perceptual audio coder (PAC) in which the advances and teachings of the 
present invention find application, and provide improvement and extension. 

FIG. 3 shows a representation of a useful masl<ing level difference factor used in threshold calculations. 

FIG. 4 presents an illustrative analysis filter bank according to an aspect of the present Invention. 
10 FIG. 5(a) through 5(e) illustrate the operation of various window functions. 

FIG. 6 is a flow chart illustrating window switching functionality, 

FIG. 7 Is a t)IockAflow diagram illustrating the overall processing of input signals to derive the output bit- 
stream. 

FIG. 8 illustrates certain threshold variations. 
15 FIG. 9 is a flowchart representation of certain bit allocation functionality. 
FIG. 10 shows bitstream organization. 

FIGs 11a through 11c illustrate certain Huffman coding operations. 
FIG. 12 shows operations at a decoder that are complementary to those for an encoder. 
FIG. 1 3 is a flowchart illustrating certain quantiratlon operations in accordance with an aspect of the pres- 
20 ent invention. 

FIG. 14(a) through 14(g) are illustrative windows for use with the filter bank of FIG. 4. 
Oetalled Description 
25 1. Overview 

To simplify the present disclosure, the following patents, patent appHcaltons and publications are hereby 
incorporated by reference in the present disclosure as if fully set forth herein: U.S. Patent 5,040,217. issued 
August 13. 1991 by K. Brandenburg et al, United States Patent Application Serial No. 07/292.598. entltied Per- 

30 cQptuQl Coding of Audio Signals, f Ued December 3D, 1988; J, D. Johnston, Transform Coding of Audio Signals 
Using Perceptual Noise Criteria, IEEE Journal on Selected Areas in Communications. Vol. 6. No. 2 (Feb. 1988)- 
International Patent Application (PCT) WO 88/01811. filed March 10, 1988; United States Patent Application 
Serial No. 07/491 ,373, entitled Hybrid Perteptuat Coding, filed March 9, 1990, Brandenburg of a/, Aspec: Adap- 
tive Spectral Entropy Coding of High Quality Music Signals, AES 90th Convention (1991): Johnston, J., Esti- 

35 (nation of Perceptual Entropy Using Noise Masking Criteria, ICASSP. (1988); J. D. Johnston. Perceptual Trans- 
form Coding of Wideband Stereo Signals, ICASSP (1989); E.F. Schroederand J.J. Platte, -'MSC: Stereo Audio 
Coding with CO-Quality and 256 kBIT/SEC.* IEEE Trans, on Consumer Elertronics, Vol. CE.33, No. 4. No- 
vember 1987; and Johnston. Transform Coding of Audio Signals Using Noise Criteria, Vol 6 No 2 IEEE 
J.S.C.A. (Feb. 1988). 

<o For clarity of explanation, the fliustrative embodiment of the present Invention is presented as comprising 
Individual functional blocks (including functional blocks labeled as -processors*). The functions these blocks 
represent may be provided through the use of efther shared or dedicated hardware, including, but not limited 
to, hardware capable of executing software. (Use of the term 'processor" should not be construed to refer ex- 
clusively to hardware capable of executing software.) Illustrative embodiments may comprise digital signal 

45 processor (DSP) hardware, such as the AT&T DSP16 or OSP32C. and software performing the operations 
discussed below. Very large scale integration (VLSI) hardware embodiments of the present invention, as well 
as hybrid DSP/VLSI embodiments, may also be provided. 

FIG. 1 is an overall blo<* diagram of a system useful for incorporating an illustrative embodiment of the 
present invention. At the level shown, the system of FIG. 1 Ulustrates systems known in the prior art. but mod- 

50 ifications. and extensions described herein will make dear the contributions of the present invention. In FIG. 
1, an analog audio signal 101 is fed into a preprocessor 102 where it is sampled (typically at 48 KHz) and con^ 
verted into a digital pulse code modulation ("PCM-) signal 103 (typically 16 bits) in standard fashion. The PCM 
signal 103 is fed into a perceptual audio coder 104 ("PACT which compresses the PCM signal and outputs the 
compressed PAC signal to a communications channel/storage medium 105. From the communications chan- 

55 nel/storage medium the compressed PAC signal is fed into a perceptual audio decoder 107 whiph decompress- 
es the compressed PAC signal and outputs a PCM signal 108 which is representative of the compressed PAC 
signal. From the perceptual audk) decoder, the PCM signal 108 is fed into a post-processor 109 which creates 
an analog representation of the PCM s^nal 108. 

An illustrative embodiment of the perceptual audio coder 104 is shown in block diagram form in FIG. 2. As 

4 




EP 0 564 089 A1 



in the case of the system illustrated In FIG. 1 , the system of FIG. 2, without more, may equally describe certain 
prior art systems, e.g.. the system disclosed in the Brandenburg, et al U,S. Patent 5.040.217. However, with 
the extensions and modifications described herein, important new results are obtained. The perceptual audio 

5 coder of FIG. 2 may advantageously be viewed as comprising an analysis filter bank 202. a perceptual model 
processor 204. a quantczer/rate-loop processor 206 and an entropy ooder 208. 

The filter bank 202 in FIG. 2 advantageously transforms an input audio signal in time/frequency in such 
n^anner as to provide both some measure of signal processing gain (i.e. redundancy extraction) and a mapping 
of the filter bank inputs in a way that is meaningful in tight of the human perceptual system. Advantageously. 

10 the well-known Modified Discrete Cosine Transform (MOOT) described, e.g.. in j.P. Princen and A.8. Bradley, 
"Analysis/Synthesis Filter Bank Design Based on Time Domain Aliasing Cancellation.** IEEE Trans. ASSP. Vol, 
34. No. 5. October* 1986. may be adapted to perform such transforming of the input signals. 

Features of the MDCT that make it useful in the present context include its critical sampling characteristic, 
i.e. for every n samples into the filter bank, n samples are obtained from the filter bank. Additionally, the MDCT 

IS typically provides half- overlap. i.e. the transform length is exacdy twice the length of the number of samples, 
n, shifted into the filterbank. The half-overlap provides a good method of dealing with the control of noise in- 
jected independently Into each filter tap as well as providing a good analysis window frequency response. In 
addition, in the absence of quantization, the MDCT provides exact reconstruction of the input samples, subject 
only to a delay of an integral number of samples. 

20 One aspect in which the MDCT is advantageously modified for use in connection with a highly efficient 
stereophonic audio coder is the provision of the ability to switch the length of the analysis window for signal 
sections which have strongly non-stationary components in such a fashion that it retains the critically sampled 
and exact reconstruction properties. The incorporated U.S. patent applicatton by Ferreira and Johnston, enti- 
tled "A METHOD AND APPARATUS FOR THE PERCEPTUAL COOING OF AUDIO SIGNALS." (referred to 

25 hereinafter as the "filter bank application") filed of even date with this application, descritms a filter bank ap- 
propriate for performing the functions of element 202 in FIG. 2. 

The perceptual model processor 204 shown in FIG. 2 calculates an estimate of the perceptual importance, 
noise masking properties, or just noticeable noise floor of the various signal components in the analysis bank. 
Signals representative of these quantities are then provided to other system elements to provide improved con- 

30 trol of the filtering operations and organizing of the data to be sent to the channel or storage medium. Rather 
than using the critical band by critical band analysis described in J.O. Johnston. Transform Coding of Audio 
Signals Using Perceptual Noise Criteria." IEEE J. on Selected Areas in Communicattons. Feb. 1968. an illus- 
trative emt>odiment of the present invention advantageously uses finer frequency resolution in the calculation 
of thresholds. Thus instead of using an overall tonality metric as in the fast-cited Johnston paper, a tonality 

35 method based on that mentioned in K. Brandenburg and J.D. Johnston, "Second Generation Perceptual Audio 
Coding: The Hybrid Coder." AES sgth Convention. 1990 provides a tonality estimate that varies over frequency, 
thus provkiing a better fit for complex signals. 

The psychoacoustic analysis performed in the perceptual model processor 204 provkies a noise threshold 
for the L (Left). R (Right). M (Sum) and S (Difference) channels, as may be appropriate, for both the normal 

40 MOCT window and the shorter windows. Use of the shorter windows is advantageously controlled entirely by 
the psychoacoustic modal processor. 

In operation, an illustrative embodiment of the perceptual model processor 204 evaluates thresholds for 
the left and right channels, denoted THR| and THR,. The two thresholds are then compared in each of the 
illustrative 35 coder frequency partttk)ns (56 partitions in the case of an active window-switched block). In each 

45 partition where the two thresholds vary between left and right by less than some amount. typk:ally 2dB, the 
coder is switched into M/S mode. That is. the left signal for that band of frequencies is replaced by M = (L'*'R)/2, 
and the right signal is replaced by S = (L-R)/2. The actual amount of difference that triggers the last-mentioned 
substitution will vary with bib^ate constraints and other system parameters. 

The same threshold calculation used for L and R thresholds is also used for M and S thresholds, with the 

so threshold calculated on the actual M and S signals. First the basic thresholds, denoted BTHR^ and MLD. are 
calculated. Then, the following steps are used to calculate the stereo masking contribution of the M and S sig- 
nals. 

1 . An additional factor is calculated for each of the M and S thresholds. This factor, called MLDm, and MLO«, 
is calculated by multiplying the spread signal energy, (as derived, e.g.. in J.D. Johnston, Transform Coding 
55 of Audio Signals Using Perceptual Noise Criteria.* IEEE J. on Selected Areas in Communications. Feb. 

1 g88; K. Brandenburg and J.D. Johnston. "Second Generation Perceptual Audio Coding: The Hybrid Cod- 
er." AES 89th Convention, 1930; and Brandenburg, et al U.S. Patent 5,040.217) by a masking level differ- 
ence factor shown illustratively in FIG. 3. This calculates a second level of detectability of noise across 
frequency In the M and S channels, based on the masking level differences shown in various sources. 
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2. The actual threshold for M (THRJ is calculated as THR„ = max(BTHR„. min(8THR,. MLO,)) and the 
threshold m = max(BTHFUmin(BTHR„ MLD,)) and the threshold forS is calculated as THR. =max(8THR 
min(8THR,„. MLOJ). 

In effect, the MLO signal subsUtutes for the BTHR signal in cases where there is a chance of stereo un- 
masking, it is not necessary to consider the issue of M and S threshold depression due to unequal L and R 
thresholds, because of the fact that L and R thresholds are known to be equal. 

The quantizer and rate control processor 208 used in the illustrative coder of FIG. 2 takes the outputs from 
the analysis bank and the perceptual model, and allocates bits, noise, and controls other system parameters 
so as to meet the required bit rate for the given application. In some example coders this may consist of nothing 
more than quantization so that the just noticeable difference of the perceptual model is never exceeded, with 
no (explicit) attention to bit rate; in some coders this may be a complex set of iteration loops that adjusts dis- 
tortion and bitrate.in order to achieve a balance between bit rate and coding noise. A particularly useful quan- 
Mzer and rate control processor is described in Incorporated U.S. patent application by J.O. Johnston enUHed 
■RATE LOOP PROCESSOR FOR PERCEPTUAL ENCODER/OECODER." (hereinaf terreferred to as the "rate 
loop apphcationT filed of even date with the present applicaUon. Also desirably performed by the rate loop 
processor 206. and described in the rate loop application, is the function of receiving informatton from the quan- 
tized analyzed signal and any requisite side information, inserting synchronization and framing information 
Again, these same functions are broadly described in the incorporated Brandenburg, et aL U.S. patent. 
5.040.21 7. 

Entropy coder 208 is used to achieve a further noiseless compression in cooperation with the rate control 
processor 206. In particular, entropy coder 208. in accordance with another aspect of the present Invention 
advantageously receives inputs including a quantized audio signal output from quantizer/rate-loop 206. per- 
forms a lossless encoding on the quantized audio signal, and outputs a compressed audio signal to the com- 
munications channel/storage medium 106. 

Illustrative entropy coder 208 advantageously comprises a novel variation of the minimum-redundancy 
Huffman coding technique to encode each quantized audio signal. The Huffman codes are described, e.g.. in 
OA Huffman. "A Method for the Construction of Minimum Redundancy Codes'. P/oc. IRE, 40:109*8- 11*01 
(1952) and T.M. Cover and J.A. Thomas..u3 Elements of Information Theory, pp. 92-101 (1991). The useful 
adaptations of the Huffirnan codes advantageously used in the context of the coder of FIG. 2 are described in 
more detail in the incorporated U.S. patent application by J.D. Johnston and J. Reeds (hereinafter the 'entropy 
coder application") f Ued of even date with the present application and assigned to the assignee of this appll. 
cation. Those skilled in the data communications arts wai readily perceive how to implement alternative em- 
bodiments of entropy coder 208 using other noiseless data compression techniques, including the well- known 
Lempel-Ziv compression met hods. 

The use of each of the elements shown In FIG. 2 will be described in greater detan (n the context of the 
overall system functionality: detaas of operation wiU be provided for the perceptual model processor 204. 

2.1. The Analysis Filter Bank 

The analysis filter bank 202 of the perceptual audio coder 104 receives as input pulse code modulated 
fPCW) digital audio signals (typically 16-bit signals sampled at 48KHz). and outputs a representation of the 
input signal which identifies the individual frequency components of the input signal. Specifically, an output of 
the analysis filter bank 202 comprises a Modified Discrete Cosine Transform ('M DCT") of the input signal See 
J. Princen et al. "Sub-band Transform Coding Using Filter Bank Designs Based on Time Domain Aliasino Can- 
cellation/ IEEE ICASSP. pp. 2161-2164 (1987). 

An illustrative analysis filter bank 202 according to one aspect of the present invention is presented in FIG 
4. Analysis filter bank 202 comprises an input signal buffer 302. a window multiplier 304. a window memory 
306. an FFT processor 308. an MDCT processor 310. a concatenator 311, a delay memory 312 and a data 
selector 132. 

The analysis f nter bank 202 operates on frames, A frame is conveniently chosen as the 2N PCM input audio 
signal samples held by input signal buffer 302. As stated above, each PCM input audio signal sample is rei>. 
resented by M bits. Illustratively, N = 512 and M = 16. »- k- 

Input signal buffer 302 comprises two sections: a f list section comprising N samples in buffer locattons 1 
to N, and a second section comprising N samples in buffer locattons N-H to 2N. Each frame to be coded by 
the perceptual audio coder 104 is defined by shifting N consecutive samples of the input audio signal into the 
input signal buffer 302. Older samples are located at higher buffer locations than newer samples 

Assuming that at a given time, the input signal buffer 302 contains a frame of 2N audio signal samples 
the succeec/Mg frame is obtained by (1) shifting the N audio signal samples in buffer locations 1 to N into buffed 
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locations N^l to 2N. respectively, (the previous audio signal samples in locations N^l to 2N may be either 
overwritten or deleted), and (2) by shifting into the input signal buffer 302. at buffer locations 1 to N. N new 
audio signal samples from preprocessor 102. Therefore, it can be seen that consecutive frames contain N sanv 

5 pies in common: the first of the consecutive frames having the common samples in buffer locations 1 to N. 
and the second of the consecutive frames having the common samples in buffer locations N^l to 2N. Analysis 
f nter bank 202 Is a critically sampled system (i.e.. for every N audio signal samples received by the input signal 
buffer 302. the analysis filter bank 202 outputs a vector of N scalers to the quanttzer/rate-loop 206). 

Each frame of the input audio signal is provided to the window multiplier 304 by the input signal buffer 

10 302 so that the window multiplier 304 may apply seven distinct data windows to the frame. 

Each data window is a vector of scalers called 'coefficients". While all seven of the data windows have 2N 
coefficients (i.e., the same number as there are audio signal samples in the frame), four of the seven only 
have N/2 non-zero coefficients (i.e.. one-fourth the number of audio signal samples in the frame). As is dis- 
cussed Mow, the data window coefficients may be advantageously chosen to reduce the perceptual entropy 

fS of the output of the MOCT processor 310. 

The information for the data window coefficients is stored in t he window memory 306. The window menwy 
306 may illustratively comprise a random access memory ('RAM*), read only memory ("ROM"), or other mag- 
netic or optical media. Drawings of seven illustrative data windows, as applied by window multiplier 304. are 
presented in FIG. 4. Typical vectors of coefficients for each of the seven data windows presented in FIG. 4 

20 are presented In Appendix A. As n>ay be seen in both FIG. 4 and In Appendix A, some of the data window coef* 
ficlents may be equal to zero. 

Keeping in mind that the data window is a vector of 2N scalers and that the audk) signal frame is also a 
vector of 2N scalers, the data window coefficients are applied to the audio signal irame scalers through point* 
to*potnt multiplication (I.e.. the first audio signal frame scaler is multiplied by the first data window coefficient. 

25 the second audio signal frame scaler is multiplied by the second data window coefficient, etc.). Window mul- 
tiplier 304 may therefore comprise seven microprocessors operating in parallel, each performing 2N multiple 
cations in order to apply one of the seven data window to the audio signal frame held by the input signal buffer 
302. The output of the window multiplier 304 is seven vectors of 2N scalers to be referred to as 'windowed 
frame vectors", 

so The seven windowed frame vectors are provided by window multiplier 304 to FFT processor 308. The FFT 
processor 308 performs an odd-frBquency FFT on each of the seven windowed frame vectors. The odd*fre- 
quency FFT is an Discrete Fourier Transform evaluated at frequencies: 

Wh 
2N 

S3 where k = 1. 3. 5. -.2N. and fn equals one half the sampling rate. The illustrative FFT processor 308 may com- 
prise seven conventional decimation-tn*time FFT processors operating in parallel, ea^ operating on a different 
windowed frame vector. An output of the FFT processor 308 is seven vectors of 2N complex elements, to be 
referred to collectively as "FFT vectors". 

FFT processor 308 provides the seven FFT vectors to both the perceptual model processor 204 and the 

^ f^DCT processor 310. The perceptual model processor 204'uses the FFT vectors to direct the operation of 
the data selector 314 and the quantizer/rate*loop processor 208. Deta^s regarding the operatk)n of data se- 
lector 314 and perceptual model processor 204 are presented b&low, 

MDCT processor 310 performs an MDCT based on the real components of each of the seven FFT vectors 
received from FFT processor 308. .P MDCT processor 310 may comprise seven microprocessors operating in 

^ parallel. Each such microprocessor determines one of the seven "MDCT vectors" of N real scatars based on 
one of the seven respective FFT vectors. For each FFT vector. F(k). the resulting MOCT vector! X(k). is formed 
as follows: 

X{k) = Re(F(k))cosI^^^^?^l^| 1 SkSN. 

4N 

50 The procedure need run k only to N. not 2N, because of redundancy in the result To wit. for N<kS2N: 

X(k) = - X(2N - k). 

MDCT processor 310 provides the seven MOCT vectors to concatenator 311 and delay memory 312. 

As discussed above with reference to window multiplier 304. four of the seven data windows have N/2 non- 
zero coefficients (see Figure 4c-0. This means that four of the windowed frame vectors contain only N/2 non- . 
55 zero values. Therefore, the non-zero values of these four vectors may be concatenated into a single vector of 
length 2N by concatenator 311 upon output from MDCT processor 310. The resulting concatenation of these 
vectors is handled as a single vector for subsequent purposes. Thus, delay memory 312 is presented with four 
MDCT vectors, rather than seven. 

Delay memory 312 receives the four MOCT vectors from MDCT processor 314 and concatenator 311 for 
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the purpose of providing temporary storage. Delay memory 312 provides a delay of one audio signal frame 
(as defined by input signal buffer 302) on the flow of the four IVIOCT vectors through the filter bank 202, The 
delay is provided by (/) storing the t«vo most recent consecutive sets of MDCT vectors representing consecutive 
audio signal frames and (iO presenting as input to data selector 314 the older of the consecutive sets of vectors 
Delay memory 312 may comprise random access memory (RAM) of size: 

M»2>(4xN 

where 2 is the number of consecutive sets of vectore. 4 is the number of vectors in a set N Is the number of 
elemente in an l\4DCT vector, and M is the number of bits used to represent an MDCT vector element 

Data selector 314 selects one of the four MDCT vectors provided by delay memory 31 2 to be output from 
the faterbank202 toquanti2er/rata-loop206. As mentioned above, the percephiai model processor 204 directs 
the operation of data selector 314 based on the FFT vectors provided by the FFT processor 308. Due to the 
operation of delay memory 312. the seven FFT vectors provided to the perceptual model processor 204 and 
the four MOOT vectors concurrently provided to dat^ selector 314 are not based on the same audio inputframe 
but rather on two consecutive input signal frames • the MDCT vectors based on the earlier of the fiamss and 
the FFT vectors based on the later of the frames. Thus, the selection of a specific MDCT vector Is based on 
information contained in the nextsi/ccess^e audio signal frame. The criteria according to which the perceptual 
model processor 204 directs the selection of an MDCT vector is described in Section 2.2. below. 

For purposes of an Blustrative stereo embodiment the above analysis fillerbank 202 is provided for each 
20 of the left and light channels. 
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2.2. The Perceptual Model Processor 

A perceptual coder achieves success in reducing the number of bits required to accurately represent high 
quality audio signals, in part, by introducing noise associated with quantization of information bearing signals 
such as the MDCT information irom the filter bank 202. The goal is. of course, to introduce this noise in an 
imperceptible or benign way. This noise shaping is primarily a frequency analysis instrument, so it is convenient 
to convert a signal into a spectral representation (e.g.. the MDCT vectors provided by filter bank 202) compute 
the shape and amount of the noise that will be masked by these signals and injecting it by quantizing the spec- 
tral values. These and other basic operations are represented in the structure of the perceptual coder shown 
in FIG. 2, 

The perceptual model processor 204 of the perceptual audio coder 1 04 illustratively receives its input from 
the analysis filter bank 202 which operates on successive frawos. The perceptual model processor inputs then 
typically compnse seven Fast Fourier Transform (FFT) vectors from the analysis filter bank 202 These are 

Iheoutputs of theFFTprocessor308 in the formof seven vectors of 2N complex elements, each corresponding 
to one of the windowed frame vectore. 

In order to mask the quantization noise by the signal, one must consider the spectral contents of the signal 
and the duration of a particular spectral pattern of the signal. These two aspects are related to masking in the 
frequency domain where signal and noise are approximately steady state -given the integration period of the 
hearing system- and also with masking in the time domain where signal and noise are subjected to different 
cochlear filters. The shape and length of these f iltere are frequency dependent 

Masking in the frequency domain is desaibed by the concept of simultaneous masking. Masking in the 
time domain is characterized by the concept of premasking and postmasking. These concepts are extensively 
explained in the literature: see. for example, E. Zwicker and H. Fasd. 'Psychoacoostics. Facts.and Models ' 
Spnnger-Verlag. 1990. To make these concepts useful to peroeptual coding, they are embodied in different 
ways. 

Simultaneous masking is evaluated by using perceptual noise shaping models. Given the spectral contents 
of the signal and its description in terms of noise-like or tone-like behavior, these models produce an hypo- 
thetical masking threshold that rules the quantization level of each spectral ^component This noise shaping 
represents the maximum amount of noise that may be introduced in the original signal without causing any 
perceptible difference. A measure called the PERCEPTUAL ENTROPY (PE) uses this hypothetical masking 
hreshoM to estimate the theoretical lower bound of the bilrate for transparent encoding. J. D. Johnston eI 
tmation of Perceptual Entropy Using Noise Masking Criteria,' ICASSP 1 989 

• . ^^^l^^*"*^** (in)audibility of a noise that starts some time before the masker signal which 

IS louder than the noise. The noise amplitude must be more attenuated as the delay increases. This attenuation 
level s also frequency dependent If the noise is the quantizatton noise attenuated by the first half of the svn- 
thesis window, experimental evidence indicates the maximum acceptable delay to be about 1 mfllisecond 

This problem is very sensitive and can conflict directty with achieving a good coding gain. Assuming sto- 
tionary condiUons - which is a false premiss- The coding gain is bigger for larger transforms, but. the quanti- 
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zation error spreads till the beginning of the reconstructed time segment So. if a transform length of 1024 
points ts used, with a digital signal sampled at a rate of 48000Hz. the noise will appear at most 21 milliseconds 
before the signal. This scenario is particularly criticat when the signal takes the form of a sharp transient in 

5 the time domain commonly known as an 'attack*. In this case the quantization noise is audible before the at- 
tack. The effect is known as pre-echo. 

Thus, a fixed length fOter bank is a not a good perceptual solution nor a signal processing solution for non- 
sUttonary regions of the signal. It will be shown later that a possible way to circumvent this problem is to im- 
prove the temporal resolution of the coder by reducing the analysis/synthesis window length. This is imple- 

10 mented as a window switching mechanism when conditions of attack are detected. In this way. the coding gain 
achieved by using a long analysis/synthesis window wflt be affected only when such detection occurs with a 
consequent need to switch to a shorter analysis/synthesis window. 

Postmasking characterizes the (in)audibility of a noise when it remains after the cessation of a stronger 
n^sker signal. In this case the acceptable delays are in the order of 20 milliseconds. Given that the bigger 

IS transformed time segment lasts 21 milliseconds (1024 samples), no special care Is needed to handle this sit- 
uation. 

WINDOW SWITCHING 

20 The PERCEPTUAL ENtROPY{P£) measure of a particular transform segment gives the theoretical lower 
bound of bits/sample to code that segment transparently. Due to its memory properties, which are related to 
premasking protection, this measure shows a significant increase of the RE value to its previous value -related 
with the previous segment- when some situations of strong non-stationarily of the signal (e.g. an attack) are 
presented. This important property fs used to activate the window switching mechanism in order to reduce pre- 

25 echo. This window switching mechanism is not a new strategy, having been used. e.g.. in the ASPEC coder, 
described in the ISO/MPEG Audio Coding Report, 1990. but the decision technique behind it is new using the 
PE information to accurately localize the non-stationarity and define the right moment to operate the switch. 

Two basic window lengths: 1024 samples and 256 samples are used. The former corresponds to a seg- 
ment duration of about 21 milliseconds and the latter to a segment duration of about 5 mitliseconds. Short win* 

30 dows are associated in sets of 4 to represent as much spectral data as a large window (but they represent a 
"dif ferenr number of temporal samples). In order to make the transition from large to short windows and vice- 
versa it proves convenient to use two more types of windows. A START window makes t he transition from large 
(regular) to short windows and a STOP window makes the opposite transition, as shown in FIG. 5b. See the 
above-cited Princen reference for useful information on this subject Both windows are 1024 samples wide, 

35 They are useful to keep the system critically sampled and also to guarantee the time aliasing cancellation proc- 
ess in the transition region. 

In order to exploit interchannel redundancy and irrelevancy, the same type of window is used for RIGHT 
and LEFT channels in each segment 

The stattonarity behavior of the signal is monitored at two levels. First by large regular windows, then If 

<o necessary, by short windows. Accordingly, the PE of large (regular) window is calculated for every segment 
while the PE of short windows are calculated only when needed. However, the tonality information for both 
types is updated for every segment in order to follow the continuous variation of the signal. 

Unless stated otherwise, a segment involves 1024 samples which is the length of a large regular window. 

The diagram of FIG. Sa represents all the monitoring possibDities when the segment from the point — till 
^ 2 
3N 

the point — is being analyzed. Related to diagram is the flowchart of FIG. 6 describes the monitoring sequence 

and decision technique. We need to keep in buffer three halves of a segment in order to be able to insert a 
START window prior to a sequence of short windows when necessary. FIGs, 5a-e explicitly considers the 50% 

so overlap between successive segments. 

The process begins by analysing a "new* segment with 512 new temporal samples (the remaining 512 ' 
samples belong to the previous segment). The PE of this new segment and the differential PE to the previous 
segment are calculated. If the latter value reaches a predefined threshold, then the existence of a non-statio- 
narity inside the current segment is declared and details are obtained by processing four short windows with 

55 positions as represented in FIG, 5a. The PE value of each short window is calculated resulting in the ordered 
sequence: PE1. PE2, PE3 and PE4. From these values, the exact beginning of the strong non-stationarity of 
the signal is deduced. Only five locations are possible. They are identified in FIG. 4a as LI. L-2, L3. L4 and L5. 

As it will become evident, if the non-stationarity had occurred somewhere from the point — tOt the point 
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that situation woutd have been detected in the previous segment It follows that the P£l value does not contain 
relevant information about the stationarity of the current segment. The average PE of the short windows is 
compared with the PE of the large window of the same segment A smaller PE reveals a nrwre efficient coding 

9 situation. Thus if the form9r value is not smaller than the latter, then we assume that we are facing a degenerate 
situation and the window switching process is aborted. 

It has been observed that for short windows the information about stationarity lies more on its PE value 
than on the differential to the PE value of the precedent window. Accordingly, the first window that has a PE 
value larger than a predefined threshold is detected. PE2 is identified with location LI. PE3 with L2 and PE4 

10 with location L3. In either case, a START window is placed before the current segment that will be coded with 
short windows. A STOP window is needed to complete the process. There are» however two possibilities. If 
the identified location where the strong non- stationarity of the signal begins Is L^ or 12 then, this is well inside 
the short window sequence, no coding artifacts result and the coding sequence is depicted in FIG. 5b. If the 
location if L4, then, in the worst situation, the non-stationarity may begin very close to the right edge of the 

15 last short window. Previous results have consistently shown that placing a STOP window -in coding condittons- 
in these circumstances degrades significantly the reconstruction of the signal in this switching point For this 
reason, another set of four short windows is placed before a STOP window. The resulting coding sequence is 
represented in FIG. 5e. 

If none of the short PEs is above the threshold, the remaining possibilities are L4 or L5. In this case, the 

70 problem lies ahead of the scope of the short window sequence and the first segment in the buffer may be 
immediately coded using a regular large window. 

To identify the correct location, another short window must be processed. It is represented in FIG. 5a by 
a dotted curve and its PE value. PEI^m* (3 also computed. As it Is easily recognized this short window already 
belongs to the next segment If PEVi is above the threshold, then, the location is L4 and. as depicted in FIG. 

25 5c. a START window may be followed by a STOP window. In this case the spread of the quantization noise 
will be limited to the length of a short window, and a better coding gain ts achieved. In the rare situation of the 
location being L5, then the coding is done according to the sequence of FIG. 5d. The way to prove that in this 
case that is right soluUon is by confirming that PE2„«i wilt be above the threshold.' PE2ft^i is the PE of the 
short window (not represented in FIG. 5) immediately following the window identified with PEI^^i. 

30 As mentioned before for each segment. RIGHT and LEFT channels use the same type of analysis/syn* 
thesis window. This means that a switch is done for both channels when at least one channel requires it 

It has been observed that for low bitrate applications the solution of FIG. 5c. although representing a good 
local psychoacoustic solution, demands an unreasonably large number of bits that may adversely affect the 
coding quality of subsequent segments. For this reason, that coding solution may eventually be Inhibited. 

35 It is also evident that the details of the reconstructed signal when short windows are used are closer to 
the original signal than when only regular large window are used. This is so because the attack is basically a 
wide bandwidth signal and may only be considered stationary for very short periods of time. Since short win- 
dows have a greater temporal resolution than large windows, they are able to follow and reproduce with more 
fidelity the varying pattern of the spectrum. In other words, this is the difference between a more piecise local 

40 (in time) quantization of the signal and a global (in frequency) quantization of the signal. 

The final masking threshold of the stereophonic coder is calculated using a combination of monophonlc 
and stereophonic thresholds, While the monophonic threshold is computed Independently for each channel, 
the stereophonic one considers both channels. 

The independent masking threshold for the RIGHT or the LEFT channel is computed using a psychoa- 

4$ coustic model that includes an expression for tone masking noise and noise masking tone. The latter is used 
as a conservative approximation for a noise masking noise expression. The monophonic threshold is calculated 
using the same procedure as previous work. In particular, a tonality measure conakJers the evolution of the 
power and the phase of each frequency coefficient across the last three segments to identify the signal as 
t>eing more tone-like or noise-like. Accordingly, each psychoacoustic expression is mora or less weighted than 

so the other. These expressions found in the literature were updated for better performance. They are defined 
as: 



where bark is the frequency in Bark scale. This scale is related to what we may call the cochiear fitters 
or criticat bands which, in turn, are identified with constant length segments of the basilar membrane. The final 
threshold is adjusted to consider absolute thresholds of masking and also to consider a partial premasking 
protection. 



55 



TMHffl = 19.5 ♦ barkil^ 
26.0 

NIWTdB = 6.56 . bark|^ 
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A brief description of the complete monophonic threshold calculation follows. Some terminology nmist t>8 
introduced in order to simplify the description of the operations invoSved. 

The spectrum of each segment is organized in three different ways, each one following a different purpose. 

5 1 . First, it may be organized in partitions. Each partition has associated one single Bark value. These par- 

titions provide a resolution of approximately either one MOCT line or 1/3 of a critical band, whichever is 
wider. At low frequencies a single line of the MDCT wfll constitute a coder partition. At high frequencies, 
many lines will be combined into one coder partition. In this case the Bark value associated is the median 
Bark point of the partition. This partitioning of the spectrum is necessary to insure an acceptable resolution 

to for the spreading function. As will be shown later, this function represents the masking influence among 
neighboring critical bands. 

2. Secondly, the spectrum may be organized in bands. Bands are defined by a parameter f Qe. Each band 
groups a number of spectral lines that are associated with a single scale factor that results from the final 
masking threshold vector. 

15 3. Finally, the spectrum may also be organized in sections. It will be shown later that sections involve an 
integer number of bands and represent a region of the spectrum coded with the same Huffman code book. 
Three indices for data values are used. These are: 

(1) indicates that the calculation is indexed by frequency in the MOCT line domain, 
b indicates that the calculation is indexed in the threshold calculation partition domain. In the case 
20 where we do a convolution or sum in that domain, bb will be used as the summation variable, 
n indicates that the calculation is indexed in the coder band domain. 
Additionally some symbols are also used: 

1. The index of the calculation partition, b. 

2. The lowest frequency line in the partition, oilowb. 
25 3. The highest frequency line in the partition, cDhigh^. 

4. The median bark value of the partition, bval^. 

5. The value for tone masking noise (In dB) for the partition. TMN^, 

6. The value for noise masking tone (tn dB) for the partition. NMT^. 

Several points in the following description refer to the "spreading function', tt is calculated by the following 
30 method: 

tmpx = 1.05(j-i). 

Where / is the bark value of the signal being spread.; the bark value of the band being spread into, and Unpx 
is a temporary variable. 

X = 8minimum((tmpx-,5)2 - 2(tmpx- .5).0) 
35 Where x is a temporary variable, and minimum(a.b) is a function returning the more negative of a or b. 

impy " 15.811389 ♦ 7.5(tmpx ♦ .474) - 17.5(1. ♦ (tmpx ♦ .474)2)-5 
where tmpy is another temporary variable. 

if (tmpy < - 100) then {sprdngf{ij) = 0) else {spfdngf(ij) = 10 ^'^^^ . 

^ Steps In Threshold Calculation 

The following steps are the necessary steps for calculation the SMRn used in the coder. 

1 . Concatenate 512 new samples of the input signal to form another 1024 samples segment Please refer 

to FIG. 5d. 

^ 2. Calculate the complex spectrum of the input signal using the 0*FFT as described in 2.0 and using a 
sine window. 

3. Calculate a predicted r and ^ 

The polar representation of the transform is calculated. r«, and represent the magnitude and phase 
components of a spectral line of the transformed segment. 

A predicted magnitude. r«,. and phase. are calculated from the preceding two threshold calculation 
blocks' rand ^: 

/« = 2rJt.1)-r.(t-2) 
♦« = 2Ut.1)-Ut-2) 

where t represents the current block number, t-1 indexes the previous block's data, and t-2 Indexes the data 
from the threshold calculation block before that 

4. Calculate the unpredictability measure c^ 
c». the unpredictability measure, is: 
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((rocos4>o,-racos0a,r + (ra,sin^co-f«sin(>a)) ) 



ra>+abs(rc) 

5. Calculate the energy and unpredictabOity in the threshold cafculation partitions. 
The energy in each partition. et« is: 

10 



IS and the weighted unpredictabOity. c^. is: 



20 



25 



6. Convolve the partitioned energy and unpredictabOity with the spreading function. 

bnua 

ecbb= £ ebbSpnlngf(bvalbb,bvalb) 
bb^l 



30 bm&x 

ctb= 2 Cbbspnlngf(bvalbb.bvalb) 

bbsl 

Because ct^ is weighted by the signal energy, it must be renormatized to cbt^ 

At the same ilme, due to the non- normalized nature of the spreading functbn. ecb^ should be renormaltzed 
and the normalized energy en^, calculated. 

ect>k 

^ rnornv 
The nornrtalization coefficient rnormt, ts: 

I 
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2 sprdngf(bvalbb.bvalb) 
bb»o 



7, Convert ebb tot bb. 

tbb= -.299-.43loga(cbb) 
Each tbb is limited to the range of OStb^^l. 

8. Calculate the required SNR in each partition. 

TMNb n 19.5 t bvalb^ 

NMTb » 6.56 . bvalb|^ 

Where TMN5 is the tone masking noise in dB and NMT^ is the noise masking tone value in dB. 
The required signal to noise ratio, SNR^, is: 

SNRb » tbbTMNb ♦ (1 . tbb)NMTb 
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9. Calculate the power ratio. 

The power ratio. bCt,, is: 

10. Calculation of actual energy threshold, nb^. 

nb^ 3 entJbCb 

11. Spread the threshold energy over MOCT lines, yielding nb^ 

nb = ^ 

ohighfe - o)low^ + 1 

12. Include absolute thresholds, yielding the final energy threshold of audibility. thr„ 

thr„ = nfYax(nb«,.absthr«). 

The dB values of absthr shovtm in the 'Absolute Threshold Tables* are relative to the level that a sine wave 
of t ^ Isb has in the MOCT used for threshold calculation. 

The dB values must be converted into the energy domain after considering the MDCT normalization ac- 
tually used. 

13. Pre-echo control 

14. Calculate the signal to mask ratios. SMR„. 
The table of "Bands of the Coder* shows 

1 . The index, n, of the band. 

2. The upper index, cohighn of the band n. The lower index. oloWf^. is computed from the previous band as 
ohighft. t*1. 

To further classify each band, another variable is created. The width index. width„. will assume a value 
widthn = 1 if n is a perceptually narrow band, and wtdthn ^ 0 if n is a perceptually wide band. The former case 
occurs if 

bval«rf,tg^- bval«(p^^<bandlen9th 
bandlength is a parameter set in the initialization routine. Otherv^se the latter case is assumed. 
Then, if (widthn ^ 1). the noise level in the coder band. nband„ is calculated as: 

nbandn = —rrr ; 

Ohigha-COloWn+l 



else. 

nbandn = minimum{thr«jow. thr^^highj 

Where, in this case. m{nimum(a z) is a function returning the most negative or smallest positive argu- 

' mem of the arguments a...z. 
The ratios to be sent to the decoder. SMR„, are calculated as 

SMR„=10.l09.oi ^^^°'y"'^rS 

45 ^ "'minimumitabsth*)' 

It is important to emphasize that since the tonality measure is the output of a spectrum analysis process, 
the analysis window has a sine form for all the cases of large or short segments. In particular, when a segment 
is chosen to be coded as a START or STOP window, its tonality information is obtained considering a sine win- 
dow; the remaining operations, e.g. the threshold calculation and the quantization of the coefficients, consider 
50 the spectrum obtained with the appropriate window. 

STEREOPHONIC THRESHOLD 

The stereophonic threshold has several goals. U is known that most of the time the two channels sound 
55 'alike*. Thus, some correlation exists that may be converted in coding gain. Looking into the temporal repre- 
sentation of the two channels, this correlation is not obvious. However, the spectral representation has a num- 
ber of interesting features that may advantageously be exptotted. In fact, a very practical and useful possibDity 
is to create a new basis to represent the two channels. This basis involves two orthogonal vectors, the vector 
SUM and the vector DIFFERENCE defined by the following linear combination: 
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[sum] ^ 1 [i i] [ricktI 

[dIF J 2 [l -lJ'[LEFr J 



These vectors, which have the length of the window being used, are generated in the frequency dontain 
since the transform process is by definition a linear operation. This has the advantage of stmplrfying the conv 
10 putational load. 

The first goal Is to have a more decorrelated representation of the two signals. The concentration of most 
of the energy in one of these new channels is a consequence of the redundancy that exists between RIGHT 
and LEFT channels and on average, leads always to a coding gain. 

A second goal ts to correlate the quantization noise of the RIGHT and LEFT channels and control the lo- 
ts calization of the noise or the unmasking effect This problem arises if RIGHT and LEFT channels are quantized 
and coded independently. This concept is exemplified by the following context supposing that the threshold 
of masking for a particular signal has been calculated, two situations may be creatiad. First we add to the signal 
an amount of noise that corresponds to the threshold. If we present this same signal with this same noise to 
the two ears then the noise is ntasked. However, if we add an amount of noise that corresponds to the thresh- 
20 old to the'signai and present this combination to one ear; do the same operation ifor the other ear but with 
noise uncorrelated with the previous one. then the noise is not n\asked. In order to achieve otasking again, 
the noise at both ears must be reduced by a level given by the masking level differences (MLO). 

The unmasking problem may be generalized to the fottowing form; the quantization noise Is not masked 
if it does not follow the localization of the masking signal. Hence, in particular, we may have two limit cases: 
25 center localization of the signal with unn^sking more noticeable on the sides of the listener and side locali- 
zation of the signal with unmasking more notipeable on the center line. 

The new vectors SUM and DIFFERENCE are very convenient because they express the signal localized 
on the center and also on both sides of the listener. Also, they enable to control the quantization noise with 
center and side image. Thus, the unmasking problem is solved by controlling the protection level for the MLD 
30 through these vectors. Based on some psychoacoustic information and other experiments and results, the 
MLO protection Is particularly criticat for very tow frequencies to about 3KHz. It appears to depend only on the 
signal power and not on its tonality properties. The following expression for the MLO proved to give good re- 
sults: 

35 MLD^(i) := 25.5lcos||^ 

where / is the partition index of the spectrum (see [7]), and b(i) is the /larAc frequency of the center of the partition 
/. This expression is only valid for b(i) S 16.0 i.e. for frequencies below 3KHz, The expression for the MLO 
threshold is given by: 

THR«,o(i) - C(i)10- ^ 
C(i) is the spread signal energy on the basilar membrane, corresponding only to the partitk>n /. 
A third and last goal is to take advantage of a particular stereophonic signal image to extract irrelevance 
from directions of the signal that are masked by that image. In principle, this Is done only when the stereo 
image is strongly defined in one direction, in order to not compromise the richness of the stereo signal. Based 
on the vectors SUM and DIFFERENCE, this goal is implemented by postulating the following two dual princi* 
pies: 

1 . If there is a strong depression of the signal (and hence of the noise) on both skies of the listener, then 
an increase of the noise on the middle line (center image) is perceptually tolerated. The upper bound is 
the side noise. 

2. If there is a strong localization of the signal (and hence of the noise) on the middle line, then an inaease 
of the (correlated) noise on both sides is perceptually tolerated. The upper bound is the center noise. 
However, any increase of the noise level must be corrected by the MLD threshold. 
According to these goals, the final stereophonic threshold is computed as follows. First, the thresholds 

for channels SUM and DIFFERENCE are calculated using the monophonic models for noise- masking-tone 
and tone-masking-noise. The procedure is exactly the one presented in 3.2 till step 10. At this point we have 
the actual energy threshold per band, nb^ for both channels. By convenience, we call them THRnsuM and 
THRnoif. respectively for the channel SUM and the channel DIFFERENCE. 

Secondly, the MLD threshold for both channels i.e. THRnMts.suM and THRnwuxoip. are also calculated by: 

THRrtMLasuM ~ enfcj^suMlO' ,0 
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THRnMU).oiP * enh,oii:10* ,o 
The MLO protectfon and the stereo irrelevance are considered by computing: 

nthfsuM = MAX(THRnsuM . MINfTHRnoif . THRn„u,xHF)| 
nthroiF = MAX[THRno,F , MIN(THRnsuM . THRnMU).suM)l 
After these operations, the remaining steps after the 11th, as presented in 3.2 are also taken for both chan- 
nels. In essence, t hese last thresholds are further adjusted to consider the absolute threshold and also a partial 
premasking protection. It must be noticed that this prennasking protection was simply adopted from the mono* 
phonic case. It considers a monaural time resolution of about 2 milliseconds. However, the binaural time reso- 
lution is as accurate as 6 microsecondsl To conveniently code stereo signals with relevant stereo image based 
on interchannei time differences, is a subject that needs further investigation. 



STEREOPHONIC CODER 



The simplified structure of the stereophonic coder is presented in FIG. 1 2, For each segment of data being 
analyzed, detailed information about the independent and relative behavior of both signal channels may be 
available through the information given by large and short transforms. This information is used according to 
the necessary number of steps needed to code a particular segment These steps involve essentially the se- 
lection of the analysis window, the definition on a band basis of the coding mode (R/L or S/0), the quantization 
and Huffman coding of the coefficients and scale factors and finally, the bitstream composing 



Coding Mode Selection 

When a new segment is read, the tonality updating for large and short analysis windows is done. Mono- 
phonic thresholds and the PE values are calculated according to the technique described in Section 3.1 . This 
gives the first decisk)n about the type of window to be used for both channels. 

Once the window sequence is chosen, an orthogonal coding decision is then considered. It involves the 
choice between independent coding of the channels, mode RIGHT/LEFT (RA.) or joint coding using the SUM 
and DIFFERENCE channels (S/0). This decision is taken on a band basis of the coder. This is based on the 

^ assumption that the binaural perception is a function of the output of the same critical bands at the two ears. 
If the threshold at the two channels is very different, then there is no need for MLD protection and the signals 
wilt not be more decorrelated if the channels SUM and DIFFERENCE are considered. If the signals are such 
that they generate a stereo image, then a MLD protection must be activated and additional gains may be ex- 
ploited by choosing the S/0 coding mode. A convenient way to detect this latter situation is by comparing the 

^ monophonic threshold between RIGHT and LEFT channels. If the thresholds in a particular band do not differ 
by more than a predefined value, e.g. 2dB, then the S/D coding mode is chosen. Otherwise the independent 
mode R/L is assumed. Associated which each band is a one bit flag that specifies the coding mode of that 
band and that must be transmitted to the decoder as side information. >From now on it is called a coding mode 

The coding mode decision is adaptive in time since for the same band it may differ for subsequent seg- 
ments, and is also adaptiveJn frequency since for the same segment, the coding mode for subsequent bands 
may be different. An illustration of a coding decision is given in FIG. 13. This illustration is valid for long and 
also short segments. 

At this point it is dear that since the window switching mechanism Involves only monophonic measures, 
the maximum number of PE measures per segment is 10 (2 channels • (1 large window ^ 4 short windows)). 
However^ the maximum number of thresholds that we may need to compute per segment is 20 and therefore 
20 tonality measures must be always updated per segment (4 channels • (1 targe window ♦ 4 short windows)). 



BItrate Adjustment 

It was previou^y said thatthe decisions forwindow switching and forcoding mode selection are orthogonal 
in the sense that they do not depend on each other. Independent to these deasions is also the final step of 
the coding process that involves quantization, Huffman coding and bitstream composing; /.e. there is no feed- 
back path. This fact has the advantage of reducing the whole coding delay to a minimum value (1024/48000 
~ 21.3 milliseconds) and also to avoid instabilities due to unorthodox coding situations. 

The quantization process affects both spectral coefficients and scale factors. Spectral coefficients are 
clustered in bands, each band having the same step size or scale factor. Each step size is directly computed 
from the masking threshold corresponding to its band, as seen in 3.2, step 14. The quantized values, which 
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are integer numbers, are then converted to variable word length or Huffman codes. The tout number of bits 
to code the segment, considering additional fields of the bitstream. is computed Since the bilrate must be kept 
constant, the quantization process must be iteraiively done tOI that number of bits is within predefined limits. 
After the number of bits needed to coda the whole segment, considering the basic masking threshold, the de- 
gree of adjustment is dictated by a buffer control unit This control unit shares the deficit or credit of additional 
bits among several segments, according to the needs of each one. 

The technique of the bitiate adjustment routine is represented by the flowchart of FIG. 9 It may be seen 
that after the total number of available bits to be used by the current segment is computed, an Iterative pro- 
cedure tries to find a factor a such that if all the initial thresholds are nuittipiied by this factor the final total 
number of bits is smaller then and within an error 8 of the available number of bits. Even if the'approximation 
curve IS so hostile that a is not found within the maximum number of iterations, one acceptable solution is al- 
ways available. 

The main steps of this routine are as foOows. First, an Interval including the solution is found Then a loop 
seeks to rapidly converge to the solution. At each iteration, the best solution is updated. 

In order to use the same procedure for segments coded with large and short windows, In this latter case 
the coefficients of the 4 short windows are clustered by concatenating homologue bands. Scale factors are 
clustered In the same. 

The bilrate adjusbnent routine calls another routine that computes the total number of bits to represent all 
the Huffman coded words (coefficients and scale factors). This latter routine does a spectrum partioning ac- 
cording to the amplitude distributioQof the coefficients. The goal is to assign predefined Huffman code books 
to sections of the spectrum. Each section groups a variable number of bands and its coefficients are Huffman 
coded with a convenient book. The limits of the section and the reference of the code book must be sent to 
the decoder as side information. 

The spectrum partioning is done using a minimum cost strategy. The main steps are as follows First all 
possible sections are defined -thelimit is one section per band- each one having the code book thatbestmateh- 
es the amplitude dislribuyon of the coefficients within that section. As the beginning and the end of the whole 
spectrum is known, if K is the number of secUons. there are K-1 separators between sections. The price to 
eliminate each separator is computed, "me separator that has a lower price is eliminated (initial prices may be 
negative). Prices are computed again before the next Iteration. This process is repeated tfll a maximum allow- 
able number of sections is obtained and the smallest price to eliminate another separator is higher than a ore- 
defined value. 

Aspects of the processing accomplished by quantizer/rate-loop 206 in FIG. 2 wDI now be presented. In the 
prior art, rate-loop mechanisms have contained assumptions related to the monophonic case. Wfth the shift 
from monophonic to stereophonic perceptual coders, the demands placed upon the rate-loop are increased 

The inputs to quantizer/rate-loop 206 In R6. 2 comprise spectral coeff taents (i.e.. the MDCT coefficient^) 
denved by analysis filter bank 202, and outputs of perceptual model 204. including calculated thresholds cor- 
responding to the spectral coefficients. 

Quantizer/rate-loop 206 quantizes the spectral informatton based. In part, on the calculated thresholds 
and the absolute thresholds of hearing and in doing so provides a bitstream to entropy coder 208. The bitstream 
includes signals divided into three parts: (1) a first part containing th6 standardized side information- (2) a sec- 
ond part conuining t he scaling factors for the 35 or 56 bands and additional side Information used for so<alled 
adaptive-window switching, when used (the length of this part can vary depending on information in the first 
part) and (3) a third part comprising the quantized spectral coefficients. 

A 'utilized scale factor*. A, is iteratively derived by interpolating between a calculated scale factor and a 
scale factor derived from the absolute threshold of hearing at the frequency corresponding to the frequency 
of the respective spectral coefficient to be quantized unGI the quantized spectral coefficients can be encoded 
within permissit>le limits. 

An aiustrative embodiment of the present inventnn can be seen in FIG. W. As shown at W01 q uantizer/rate- 
loop receives a spectral coefficient. and an energy threshold. E. corresponding to that spectral coefficient 
A 'threshold scale factor", do is calculated by 

Ao = Vl2E 

An -absolute scale factor. A*, is also calculated based upon the absolute threshold of hearing (i.e.. the quietest 
sound that can be heard at the frequency corresponding to the scale factor). Advantageously, an interpolabon 
constant, o, and interpolation bounds and are initialized to aid In the adjustment of the utilized scale 

factor 

a«(,h = 1 



16 



EP 0 564 089 A1 



Next, as shown in W05, the utilized scaie (actor is determined from: 

A = Ao* ^ Aa*' • 

5 Next, as shown in W07. the uUiized scale factor is itself quantized because the utnized scale factor as com- 
puted above is not discrete but is advantageously discrete when transmitted and used. 

A = Q-HQ{A)) 

Next, as shown in W09, the spectral coefficient is quantized using the utilized scale factor to create a "quan- 
tized spectral coef ficienf Q(Cf«A). 

Q(Q.A) = N(NT(% 

A 

where 'NINF is the nearest Integer function. Because quantizer/rate loop 206 must transmit both the quantized 
spectral coefficient and the utilized scale factor, a cost. is calculated which is associated with how many 
bits it will take to transmit them both. As shown in FIG. W11. 

*5 C = FOO(Q(Cf.A).Q(A)) 

where FOO is a function which, depending on the specific embodiment, can be easily determined by persons 
having ordinary skill In the art of data communications. As shown in W13, the cost, C is tested to determine 
whether it is In a permissible range PR. When the cost is within the permissible range, Q(Q,A) and Q(A) are 
transmitted to entropy coder 208. 

20 Advantageously, and depending on the relationship of the cost C to the permissible range PR the inter- 
polation constant and bounds are adjusted until the utilized scale factor yields a quantized spectral coefficient 
which has a cost within the permissible range. Illustratively, as shown in FIG. W at W13. the interpolation 
bounds are manipulated to produce a binary search. Specifically, / 
when C > PR, a^^u = a, 

25 alternately, 

when c < PR. = a. 
In either case, the interpolation constant is calculated by: 

2 

30 The process then continues at W05 iteratively until the C comes within the permissible range PR. 
STEREOPHONIC DECODER 

The stereophonic decoder has a very simple structure* Its main functions are reading the incoming bit* 
3S Stream, decoding alt t he data, inverse quantization and reconstruction of RIGHT and LEFT channels. The tech- 
nique is represented in FIG. 12. 

Illustrative embodiments may comprise digital signal processor (DSP) hardware, such as the AT&T DSP16 
or DSP32C, and software performing the operations discussed below. Very large scale integration (VLSI) hard- 
ware embodiments of the present invention, as welt as hybrid OSP/VLSI embodiments, may also be provided. 



Claims 

A method of coding a digital input signal to provide a coded digital output signal, the method comprising 
the steps of: 

sampling the digital input signal to create a frame of 2N input signal samples; 
analyzing the frame of signal samples with an odd-frequency fast Fourier transform to provide a 
frame of 2N Fourier coefficients; and 

outputting a coded signal comprising samples X(k), each sample X{k) provided by multiplying the 

real part of a Fourier coefficient, F(k), by cos t*^^^ ^X^ + N) ]. 
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